2015. No. 3(33)

Full text of the journal

Display abstracts

Decision making and business intelligence

Nikolay Golov , Lars Ronnback

SQL query optimization for highly normalized Big Data

7–14

Nikolay I. Golov - Lecturer, Department of Business Analytics, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail:ngolov@hse.ru

Lars Ronnback - Lecturer, Department of Computer Science, Stocholm University
Address: SE-106 91 Stockholm, Sweden
E-mail:lars.ronnback@anchormodeling.com

This paper describes an approach for fast ad-hoc analysis of Big Data inside a relational data model. The approach strives to achieve maximal utilization of highly normalized temporary tables through the merge join algorithm. It is designed for the Anchor modeling technique, which requires a very high level of table normalization. Anchor modeling is a novel data warehouse modeling technique, designed for classical databases and adapted by the authors of the article for Big Data environment and a massively parallel processing (MPP) database. Anchor modeling provides flexibility and high speed of data loading, where the presented approach adds support for fast ad-hoc analysis of Big Data sets (tens of terabytes).
Different approaches to query plan optimization are described and estimated, for row-based and column-based databases. Theoretical estimations and results of real data experiments carried out in a column-based MPP environment (HP Vertica) are presented and compared. The results show that the approach is particularly favorable when the available RAM resources are scarce, so that a switch is made from pure in-memory processing to spilling over from hard disk, while executing ad-hoc queries. Scaling is also investigated by running the same analysis on different numbers of nodes in the MPP cluster. Configurations of five, ten and twelve nodes were tested, using click stream data of Avito, the biggest classified site in Russia.

Alexey Masyutin

Credit scoring based on social network data

15–23

Alexey A. Masyutin - Post-graduate student, School of Data Analysis and Artificial Intelligence, Faculty of Computer Science, National Research University Higher School of Economics.
Address: 20, Myasnitskaya str., Moscow, 101000, Russian Federation.
E-mail: alexey.masyutin@gmail.com

Social networks accumulate huge amounts of information, which can provide valuable insights on people’s behavior. In this paper, we use social data from Vkontakte, Russia’s most popular social network, to discriminate between the solvent and delinquent debtors of credit organizations. Firstly, we present the datacenter architecture for social data retrieval. It has several functions, such as client matching, user profile parsing, API communication and data storing. Secondly, we develop two credit scorecards based exclusively on social data. The first scorecard uses the classical default definition: 90 days delinquency within 12 months since the loan origination. The second scorecard uses the classical fraud definition as falling into default within the first 3 months. Both scorecards undertake WOE-transformation of the input data and run logistic regression afterwards. The findings are as follows: social data better predict fraudulent cases rather than ordinary defaults, social data may be used to enrich the classical application scorecards. The performance of the scorecards is at the acceptable level, even though the input data used were exclusively from the social network. As soon as credit history (which usually serves as input data in the classical scorecards) is not rich enough for young clients, we find that the social data can bring value to the scoring systems performance. The paper is in the area of interest of banks and microfinance organizations.

Mathematical methods and algorithms of business informatics

Elena Goryainova, Julia Shalimova

Reducing the dimensionality of multivariate indicators containing non-linearly dependent components

24–33

Elena R. Goryainova - Associate Professor, Department of Mathematics, Faculty of Economic Sciences, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail: el-goryainova@mail.ru

Julia A. Shalimova - Graduate Student, Faculty of Economic Sciences, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail: july.shalimova@yandex.ru

To solve the problem of reduction of the multidimensional vector of indicators methods of factor analysis are used. One of them is the maximum likelihood method (MLM). It allows to identify uncorrelated common factors among the set of correlated quantitative indicators. The uncorrelated common factors can represent initial indicators without significant loss of information. Common factors are detected using a special representation of the correlation matrix of the observed indicators. However, the correlation coefficient is not defined for the characteristics measured in a nominal scale. In addition, it cannot serve as a measure for the strength of the coupling indicators with nonlinear dependence. Traditional methods of factor analysis are ineffective for such situations. Two MLM modifications are proposed in the paper. They use the rank Spearman correlation coefficients and Cramer coefficients as measures of relationship between variables. 12-dimensional vectors with their coordinates dependent on each other with linear and nonlinear dependency were simulated, using the Monte Carlo method. Then a comparative analysis of the effectiveness of the traditional MLM and the two proposed modifications of the MLM was carried out for these data. It is shown that only adapted method that uses the Cramer coefficients is able to combine correctly the indicators related with nonmonotonic dependency in the common factor. On the other hand, this method has a lower efficiency than the other two methods in the cases where the dependency between variables is linear or monotonic. To demonstrate the efficiency of these methods on real data, the task of reducing the dimension of the dynamics of the relative consumer price growth in the years 2008-2014 for a group of food products has been solved.

Data analysis and intelligence systems

Vladimir Laptev, Paul Orlov

Cluster analysis of visual perception of data structure

34–43

Vladimir V. Laptev - Associate Professor, Department of Engineering Graphics and Design, Institute of Metallurgy, Mechanical Engineering and Transport, St. Petersburg State Polytechnical University
Address: 29, Politekhnicheskaya Street, St. Petersburg, 195251, Russian Federation.
E-mail: laptevsee@yandex.ru

Paul A. Orlov - Senior Lecturer, Department of Engineering Graphics and Design, Institute of Metallurgy, Mechanical Engineering and Transport, St. Petersburg State Polytechnical University; Senior Lecturer, Department of Media Design and Information Technologies, School of Journalism & Mass Communication, St.Petersburg State University
Address: 29, Politekhnicheskaya Street, St. Petersburg, 195251, Russian Federation.
E-mail: paul.a.orlov@gmail.com

Data structures are common indicators in the fields of management and business. Infographics (serious graphics), a special area of Communication Design, provides a number of graphical ways to visualize this type of data. The application of each available chart type corresponds to certain limitations, which are associated with features of visual perception and semiotic aspects. In our study, we chose the Sankey flow diagram because of an insufficient degree of scrutiny. This type of diagram is often used to represent data structure in business processes. We built an eye-tracking study to identify the methods of assessment forms of graphical image of data structure visualization. In our experiment, we used a 4-flow Sankey diagram as a stimulus.
Hierarchical divisive algorithms were taken as the method of analysis. This method works with a universal cluster consisting of all gaze fixation, followed by step partitioning it into smaller pieces. It has been found that there are at least four clusters based on the coordinates. In the present model, we found an “input” cluster and an “output cluster group” and clearly defined the central cluster of gaze fixations. Increasing the number of clusters changes the picture in the direction of greater detail. We show a certain narrative that is traced when viewing charts. This narrative identifies the sequence of flows “movement” from the whole to its structural parts. As a result, cluster analysis allows the visual interpretation of numerical data structures in a range of tasks to support decision making that can be solved by software.

Andrey Mokeyev, Vladimir Mokeyev

On efficiency of face recognition using linear discriminant analysis and principal component analysis

44–54

Andrey V. Mokeyev - Senior Lecturer, Department of Information Systems, Faculty of Economics and Entrepreneurship, South Ural State University
Address: 76, Lenina prospect, Chelyabinsk, 454080, Russian Federation.
E-mail:gr.smk@mail.ru

Vladimir V. Mokeyev - Head of Department of Information Systems, Faculty of Economics and Entrepreneurship, South Ural State University
Address: 76, Lenina prospect, Chelyabinsk, 454080, Russian Federation.
E-mail: mokeyev@mail.ru

The solution of the face recognition problem by means of principal component analysis (PCA) and linear discriminant analysis (LDA) is being considered. The main idea of this approach is that firstly, we project the face image from the original vector space to a face subspace via PCA, secondly, we use LDA to obtain a linear classifier. In the paper, the efficiency of the PCA+LDA approach to face recognition without preliminary processing (scaling, rotation, translating) is investigated. Research shows that the higher the number of images in a class of teach sample, the higher the face recognition rate. When the number of images is small, face recognition performance can be improved by expanding the training set using the images received by scaling and rotating of initial images. The efficiency of PCA+LDA approach is investigated on the images of ORL database. When processing large sets of images, methods of linear condensation and principal component synthesis are suggested to calculate the main components. The principal component synthesis method is based on splitting an initial image set into small sets of images, obtaining eigenvectors of these sets (particular solutions) and calculation of eigenvectors of an initial image based on particular solutions. The linear condensation method is based on the decrease of an order of matrix allowing to calculate pretty exactly eigenvectors whose eigenvalues are located in the preset interval. It is shown that linear condensation and principal component synthesis methods allow to decrease significantly the processing time of building a classifier by PCA+LDA approach, without reducing face recognition rate.

Information systems and technologies in business

Maria Anikanova, Alexander Morgunov

Criterial evaluation of the possibility of small businesses business process automation on public cloud platform

55–64

Maria A. Anikanova - Cloud Solutions Specialist, Small and Medium Solutions and Partners Unit, Microsoft Russia.
Address: 17/1, Krylatskaya Street, Moscow, 121614, Russian Federation.
E-mail: v-maanik@microsoft.com

Alexander F. Morgunov - Associate Professor, Department of Corporate Information Systems, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail: amorgunov@hse.ru

      The article is dedicated to the research of the possibility and viability of small companies business processes automation using public cloud SaaS applications. One of the fundamental advantages of cloud solutions is IT infrastructure simplification along with high-level scalability and rich functionality. Cloud counterparts of such “heavy” on-premise software as ERP or CRM systems do not require large financial investments and time expenditures, due to a more simple and agile platform, which support requires less effort, giving IT specialists an opportunity to reorient at more important projects. One of the most significant advantages of such solutions is the fact that the major part of IT expenses can be converted from capital to operational costs, which gives small business companies the possibility not to withdraw a big amount of money from corporate cash flow.
      The cost of SaaS applications is much lower than one-time expenditure on on-premise products implementation. However, the cost of error for small organizations on the stage of decision-making concerning IT infrastructure construction and management (including SaaS-based architecture) is still high, for any further IT infrastructure changes will require significant additional costs and can turn out to be critical for the company’s budget. That is why the set of criteria, which allows companies to define the expediency of public cloud applications technological possibilities usage on the stage of small business IT-infrastructure planning, is observed in the article. All developed criteria are divided into three main groups: functional, financial & economic, and technical; they are described in details separately, and then ranged according to their importance using expert evaluation method, involving recognized IT experts. The formula was developed using quantitative estimations, which helped to derive the specific index, evaluating reasonability of concrete business processes automation with the help of public cloud SaaS applications.
      The article materials will prove to be of interest to information systems integration specialists and small business decision makers in order to estimate optimization of IT costs.

Sergey Yampolsky , Anatoly Shalamov , Alexander Kirsanov, Eugene Ogurechnikov

Cost management for the supply of spare parts for after-sales service of complex technical products

65–73

Sergey M. Yampolsky - Associate Professor, Department of Business Analytics, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail:syampolsky@hse.ru

Anatoly S. Shalamov - Researcher, Department of Statistical Problems of Informatics and Management, Institute of Informatics Problems, Russian Academy of Sciences.
Address: 44, Vavilova Street, Moscow, 119333, Russian Federation.
E-mail:a-shal5@yandex.ru

Alexander P. Kirsanov - Professor, Department of Business Analytics, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail:ki@hse.ru

Eugene V. Ogurechnikov - Senior Lecturer, Department of Business Analytics, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail:eogurechnikov@hse.ru

      The article considers the issues of technical product life cycle management in the field of spare parts delivery organization and management within the framework of after-sales service.
      It provides an examination of a Petri net model, describing the cause-effect relations between events that are linked to delivery planning and management, based on a probabilistic analytical model for after-sales service of technical products and a program-based risk analysis system based on technical and economic criteria. The result of a given model’s performance is planning of an acceptable balance between the cost and quality of products and their current maintenance, which includes detection and minimization of financial risks.
      An example that illustrates automated planning of spare parts delivery is given. Dynamics of operated technical products’ quantity variation is represented in the integrated graphic type, providing an opportunity to predict an average factor of technical product’s serviceability, determined both by a number of serviceable technical products in a warehouse of the customer and productivity of repair agencies.
      The earned value method application is proved to be an effective tool for risk analysis of schedule variance in the field of spare parts delivery. Monitoring of the earned value of finances permits to forecast not only the probability of successful completion of spare parts delivery, but also the risks of both cost and schedule variance.
      An example of automated risk analysis is provided. Estimated coincidence degree of actual cost and planned value is calculated by means of the effectiveness index, which is used to analyze the quality of customer’s subdivisions performance and to correct further functioning. For a selected year, the effectiveness index can be defined and optimized for the predetermined serviceability factor, assigned for every customer during the process of automated planning of spare parts delivery.
      The approach presented in the article can be considered quite universal, which predetermines an opportunity to apply it in order to provide solutions for product and service life cycle management problems in various organizational technical and economic systems.

Rafael Sukhov, Maxim Amzarakov, Evgeny Isaev, Svetlana Maltseva

Data centers and assets of a company

74–79

Rafael R. Sukhov  - Finance Manager, INO Uptime Technology 
Address: 6, Bolshoy Koptevskiy proezd, Moscow, 125315, Russian Federation.  
E-mail: r.sukhov@uptimetechnology.ru

Maxim B. Amzarakov  - Director, INO Uptime Technology 
Address: 6, Bolshoy Koptevskiy proezd, Moscow, 125315, Russian Federation.  
E-mail: m.amzarakov@uptimetechnology.ru

Eugene A. Isaev  - Professor, Head of Department of Information Systems and Digital Infrastructure Management, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics; Head of Laboratory, P.N.Lebedev Physical Institute, Russian Academy of Sciences.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail: eisaev@hse.ru

Svetlana V. Maltseva - Professor, Head of Department of Innovation and Business in Information Technologies, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics.
Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation.
E-mail: smaltseva@hse.ru

      The paper is focused on data centers and assets of a company, including their relationship and interaction. The purpose of the article – to give an idea of how the Data Center may have an impact on the company’s assets and their final value. The aspects that are important for understanding the reasons of companies’ interest in formation of the investment object and subsequent accounting of such investments as a significant part of a company’s assets are discussed. Justification of the fact that in some enterprises a data center itself is an important asset, and in some business models – a key asset of a company is provided.
      Relying on the definitions of “assets” and “data center” terms the variants of participation of a data center in the business of an enterprise and its influence on the company’s final value through the company’s assets are discussed in the article.
      The article presents examples of how a data center becomes the subject of production in large enterprises whose business is based on the storage, processing and delivery of information services related with access to this information. Some examples of such companies representing different industries are provided.
      The issues of statutory regulation on the requirements related with establishment of data centers for the purposes of performing regulatory functions are considered. The questions of corporate security and the impact of data centers on information safety are discussed.
      Certain attention is paid to indirect influence of a data center on a company’s assets value by improving data reliability, improving security of stored and processed data and as a result the impact on the market value of an enterprise as a business through increasing of consumers’ confidence.